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1.  Foreword 

Performance  and  reliability  are  major  concerns  in  the  design  of  large  disk  arrays.  Hellerstein 
et  al.  pioneered  the  study  of  erasure-resilient  codes  that  allow  one  to  reconstruct  data 
without  loss  in  the  presence  of  disk  failures.  Chee,  Colbourn,  and  Ling  used  the  close 
connection  between  erasure-resilient  codes  and  certain  combinatorial  designs  to  establish 
much  improved  asymptotic  and  exact  existence  results  for  these  codes.  The  design-theoretic 
approach  provided  the  scientific  basis  for  the  project. 

In  the  subsequent  sections,  we  first  provide  the  relevant  background  on  the  design  of 
erasure  codes  for  RAID,  contrasting  these  with  the  more  extensively  studied  erasure  codes 
for  digital  communications.  Then  we  summarize  highlights  of  the  research  in  the  ARO 
project  now  completed.  Our  research  effort  on  codes  for  disk  arrays  revealed  an  unexpected 
means  of  optimizing  I/O  performance  through  appropriate  orderings  of  codewords.  Indeed 
our  simulation  results  show  a  marked  improvement  in  performance  when  codewords  are 
ordered  such  that  consecutive  sets  of  codewords  exhibit  a  maximum  overlap.  We  undertook 
an  investigation  of  optimal  orderings  for  triple  erasure  codes,  and  obtained  substantial  results 
on  orderings  for  double  erasure  codes. 
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4.  Statement  of  the  Problem  Studied 

Over  the  last  decade,  there  has  been  a  sustained  exponential  advance  in  the  density  and  per¬ 
formance  of  semiconductor  technology.  With  this  progress  has  come  faster  microprocessors 
as  well  as  larger  and  faster  primary  memory  devices.  Improvements  in  secondary  storage 
systems,  on  the  other  hand,  have  not  kept  pace.  While  the  performance  of  RISC  micropro¬ 
cessors  has  been  increasing  by  more  than  50%  per  year  [42],  disk  transfer  rates,  which  depend 
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•  on  the  speed  of  mechanical  movements  and  magnetic  media  densities,  have  only  improved  by 
about  20%  each  year  [10].  This  phenomenon  has  transformed  many  computationally-bound 
applications  into  I/O-bound  applications. 

Amdahl  [2]  predicted  three  decades  ago  that  unless  accompanied  by  corresponding  in¬ 
creases  in  secondary  storage  performance,  big  increases  in  microprocessor  performance  can 
only  bring  about  marginal  improvements  in  overall  system  performance.  This  disparity  has 
led  to  the  consideration  of  parallelism  as  a  means  to  speed  up  secondary  storage  systems. 
Several  ideas  have  been  proposed  as  to  how  parallelism  can  be  exploited.  The  most  important 
and  successful  is  the  disk  array  architecture. 

The  disk  array  architecture  organizes  many  independent  small  disks  into  one  large  logical 
disk.  Small  disks  are  preferable  to  large  ones  because  they  have  a  lower  cost  and  consume 
less  power.  For  improved  performance,  disk  arrays  employ  the  concept  of  data  striping  [45], 
which  spreads  data  across  multiple  disks.  This  allows  both  single  and  multiple  I/O  requests 
to  be  processed  in  parallel  by  separate  disks,  thus  improving  effective  transfer  rates.  A 
further  advantage  of  disk  striping  is  uniform  load  balance. 

The  more  disks  we  have  in  a  disk  array,  the  higher  the  performance  we  obtain.  Unfortu¬ 
nately,  large  disk  arrays  have  low  probability  of  having  all  disks  functional.  Failures  in  disk 
arrays  are  often  assumed  to  satisfy  the  memoryless  property,  that  is,  the  life  expectancy  of 
a  disk  is  dependent  only  upon  the  condition  that  the  disk  is  working  now.  Under  this  as¬ 
sumption,  the  reliability  of  a  disk  array  is  modeled  by  the  exponential  distribution  [29] .  As  a 
consequence,  for  low  disk  failure  rates,  the  failure  rate  of  a  disk  array  is  directly  proportional 
to  the  number  of  disks  it  contains.  Many  applications,  notably  database  and  transaction 
processing  systems,  require  both  high  throughput  and  high  data  availability  of  their  storage 
systems.  The  most  demanding  of  these  applications  require  continuous  operation,  which  in 
terms  of  a  storage  system  requires  (i)  the  ability  to  satisfy  all  requests  for  data  even  in  the 
presence  of  disk  failures,  and  (ii)  the  ability  to  reconstruct  the  content  of  a  failed  disk  onto  a 
replacement  disk,  thereby  restoring  itself  to  a  fault-free  state.  These  requirements  strongly 
encourage  the  introduction  of  redundancy  to  tolerate  disk  failures.  Disk  arrays  which  in¬ 
corporate  redundancy  have  come  to  be  known  as  Redundant  Arrays  of  Independent  Disks 
(RAID). 

There  are  three  primary  types  of  disk  failures.  Transient  errors  arise  from  noise  corrup¬ 
tion  and  are  dealt  with  by  repeating  the  requests.  Media  defects  are  caused  by  permanent 
defects  in  material,  and  are  detected  and  masked  by  the  manufacturer.  Catastrophic  failures 
include  head  crashes  and  failures  of  the  disk  controller  electronics.  When  a  disk  suffers  a 
catastrophic  failure,  its  data  is  rendered  unreadable,  and  is  effectively  erased.  We  therefore 
call  such  a  disk  failure  an  erasure.  For  convenience,  we  also  call  a  set  of  k  disk  failures  a 
k-erasure.  Error-correcting  codes  can  be  used  to  tolerate  erasures.  However,  components  in 
disk  arrays  allow  us  to  determine  exactly  where  erasures  have  occurred.  It  is  possible  to  take 
advantage  of  this  additional  information  to  derive  codes  that  are  better  than  those  based  on 
error-correcting  codes. 

Elias  [28]  apparently  was  the  first  to  distinguish  between  erasures  and  errors,  and  to 
develop  a  model  of  the  erasure  channel.  Rabin  [43]  investigated  erasure-resilient  codes  for 
information  dispersal.  The  intended  application  in  Rabin’s  work  arises  when  losses  are 
frequent,  and  there  is  a  relatively  small  overhead  in  having  a  large  amount  of  redundant 
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data.  Alon  et  al.  [1]  also  studied  erasure-resilient  codes  to  combat  bursty  losses  in  packet- 
switched  networks.  The  communications  environment  in  which  a  substantial  fraction  of 
packets  are  erased  (lost)  is  a  practical  model  of  the  Internet  as  currently  deployed.  For  this 
reason,  erasure  codes  that  deal  with  a  large  fraction  of  erasures  have  been  very  aggressively 
studied.  A  sample  of  the  most  relevant  papers  in  this  area  includes  [37,  38,  6,  46].  Coding 
files  for  broadcast  or  transmission  permits  both  large  scale  loss  and  high  levels  of  redundancy, 
and  involves  typically  a  very  large  number  of  information  and  redundant  (check)  packets.  In 
disk  arrays,  however,  one  finds  a  very  different  application  of  erasure  codes.  In  that  context, 
erasures  are  much  rarer.  More  importantly,  the  sizes  of  the  disk  arrays  involved  is  orders 
of  magnitude  smaller  than  the  number  of  packets  in  a  file  broadcast  such  as  the  Digital 
Fountain  [6].  Hence  the  parameters  of  erasure  codes  of  interest  are  quite  different  in  these 
two  applications,  despite  similar  definitions  and  objectives. 

Hellerstein  et  al.  [32]  first  proposed  the  use  of  erasure-resilient  codes  for  large  disk  arrays. 
Chee,  Colbourn,  and  Ling  [8]  extended  their  work,  establishing  an  extensive  combinatorial 
framework  for  their  study.  By  interpreting  the  coding  problem  in  the  context  of  extremal  set 
theory,  we  have  already  obtained  new  classes  of  optimal  and  asymptotically  optimal  erasure- 
resilient  codes.  These  codes  improve  and  extend  previous  results  in  the  literature.  Our 
treatment  has  also  revealed  interesting  and  surprising  connections  to  combinatorial  design 
theory.  The  mathematical  study  of  erasure  codes,  especially  in  the  disk  array  context  when 
the  number  of  erasures  is  small,  lies  within  the  theory  of  error-correcting  codes  [40].  In  this 
direction,  relevant  research  concerns  low  density  parity  check  matrices;  see  [4,  5,  35,  49], 
especially  [4],  in  which  an  application  to  RAID  is  discussed. 

Hellerstein  et  al.  [32]  formulate  the  construction  of  erasure  codes  for  disk  arrays  as 
follows.  A  system  of  linear  equations  modulo  2  specifies  the  contents  of  c  check  or  parity 
disks  as  functions  of  the  contents  of  n  information  or  data  disks.  Often  these  are  represented 
in  matrix  form  using  a  parity  check  matrix  H  =  [P\I].  H  is  a  c  x  (c  +  n)  matrix  of  Os  and 
Is.  P  is  a  c  x  n  matrix  with  rows  indexed  by  check  disks  and  columns  by  information  disks, 
so  that  the  (i,j)  entry  is  1  if  and  only  if  information  disk  j  is  checked  by  check  disk  i.  I  is 
a  c  x  c  identity  matrix  with  rows  and  columns  indexed  by  the  check  disks. 

An  [n,  c,  k\- erasure-resilient  code,  or  briefly  an  [ n,c,k]-ERC ,  consists  of  an  encoding  al¬ 
gorithm  £  and  a  decoding  algorithm  V  with  the  following  properties.  Given  an  n-tuple  S 
of  stripes,  £  produces  an  (n  +  c)-tuple  or  codeword  £(S )  =  ( £i(S ), . . . ,£n+c(S))  of  stripes 
such  that  for  any  I  C  {1, . . . ,  n},  where  |/|  =  n  +  c  —  k,  the  decoding  algorithm  V  is  able 
to  recover  S  from  (/,  {£*(5)  \  i  G  /}).  We  often  call  an  [n,  c,  fc]-ERC  a  /c-ERC  when  the 
parameters  n  and  c  are  not  important  in  the  context. 

Erasure  correction  capability  ( reliability )  is  completely  specified  by  the  parity  check  ma¬ 
trix:  A  set  of  disk  failures  can  be  corrected  if  and  only  if  the  corresponding  set  of  columns 
of  H  is  linearly  independent  modulo  2  [32].  Aspects  of  performance  are  also  specified  by  the 
structure  of  this  matrix.  In  particular,  the  column  sums  of  P  specify  the  update  penalties, 
reflecting  the  cost  of  writing  redundant  information  when  a  data  disk  is  written.  The  row 
sums  of  H  specify  the  group  sizes,  indicating  the  cost  of  reconstructing  a  failed  disk.  The 
ratio  of  c  to  n  is  the  check  disk  overhead,  the  additional  cost  in  terms  of  number  of  disks 
which  we  incur  in  order  to  maintain  the  redundant  information.  More  precisely,  we  have: 

Check  disk  overhead:  This  is  the  ratio  of  the  number  of  check  disks  to  information  disks. 


3 


An  [n,  c,  fc]-ERC  has  a  check  disk  overhead  of  c/n. 

Update  penalty:  This  is  the  number  of  check  disks  whose  content  must  be  changed  when 
an  update  is  made  in  the  content  of  a  given  information  disk.  We  call  these  disks 
the  disks  associated  with  the  information  disk.  If  m  check  disks  need  to  be  involved  in 
every  write,  then  the  parallelism  of  the  disk  array  is  reduced  by  a  factor  of  m+1.  Since 
parallelism  is  the  reason  behind  using  disk  arrays,  update  penalties  should  be  kept  as 
small  as  possible.  The  update  penalties  of  an  erasure-resilient  code  with  parity-check 
matrix  H  =  [C  \  I]  are  the  column  sums  of  C. 

Group  size:  This  is  the  number  of  disks  that  must  be  accessed  during  the  reconstruction 
of  a  single  failed  disk.  The  cost  of  reconstruction  makes  small  group  size  desirable, 
while  for  load  balancing  reasons,  uniform  group  size  is  desirable.  The  group  sizes  of 
an  erasure-resilient  code  are  the  row  sums  of  its  parity-check  matrix. 

Since  updates  of  data  are  usually  much  more  frequent  than  the  reconstruction  of  data  lost 
in  erasures,  the  update  penalties  are  typically  of  more  concern  than  the  group  sizes. 

For  performance  reasons,  the  erasure-resilient  codes  studied  are  assumed  to  satisfy  two 
conditions: 

1.  We  restrict  ourselves  to  systematic  codes.  An  [n,  c,  &]-ERC  is  systematic  if  £i(S)  =  Si, 
for  1  <  i  <  n,  where  S  =  (Si, . . . ,  Sn).  The  stripes  £i(S),  for  n  <  i  <  n  +  c,  are  called 
checks.  This  means  that  the  encoding  function  leaves  the  data  unmodified  on  some 
disks.  This  property  is  desirable  to  avoid  read  penalties  associated  with  decoding  when 
there  are  no  disk  failures. 

2.  We  restrict  ourselves  to  linear  codes  over  the  field  of  order  2L,  where  L  is  the  bit-size 
of  a  stripe.  In  this  case,  we  interpret  a  stripe  as  an  T-dimensional  vector  over  the  field 
of  order  2,  and  £  is  a  linear  function.  Hence,  computations  used  to  encode  a  stripe  are 
restricted  to  component-wise  modulo  two  arithmetic,  that  is,  the  parity  (or  ‘exclusive 
or’)  operation  0.  This  restriction  ensures  that  encodings  and  manipulations  can  be 
performed  efficiently. 


5.  Summary  of  the  Most  Important  Results 

To  date,  our  research  has  concerned  reliability  of  parity  check  matrices  whose  update  penal¬ 
ties  are  as  small  as  possible,  while  still  correcting  for  all  sets  of  d  or  fewer  erasures.  We  have 
concentrated  on  cases  when  d  —  3  or  4.  Chee  et  al. [8]  describe  optimal  codes  for  correcting 
three  or  more  erasures  arising  from  Steiner  triple  systems.  These  are  codes  which  provide 
minimal  update  penalty,  small  group  size  and  a  reasonable  check  disk  overhead.  In  addition, 
anti-Pasch  Steiner  triple  systems  are  a  class  of  codes  that  provide  higher  resilience;  see[8]. 

In  order  to  understand  performance  within  a  class  of  codes  offering  similar  erasure  reli¬ 
ability  and  to  understand  if  the  added  resilience  of  anti-Pasch  codes  affects  performance  we 
implemented  a  computer  simulation. 
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RaidSim 


RaidSim  [33,  34]  is  a  simulation  program  written  at  the  University  of  California  at  Berkeley 
[34].  Holland  [33]  extends  it  to  include  declustered  parity  and  online  reconstruction.  The 
raidSim  program  models  disk  reads  and  writes  and  simulates  the  passage  of  time.  The 
modified  version  described  in  [33],  was  further  modified  and  used  as  our  starting  point  for 
experimentation. 

Implementing  a  3-erasure  code  is  expensive  due  to  the  high  update  penalty  for  each  write. 
It  is  therefore  imperative  to  understand  how  various  factors  in  designing  these  codes  impact 
overall  performance.  Our  experiments  were  designed  to  gain  insight  into  what  these  factors 
might  be,  and  how  they  might  affect  the  behavior  of  3-erasure  codes.  The  first  issue  in 
mapping  and  layout  of  a  disk  array  is  that  of  modeling  reads  and  writes.  In  a  single  error 
correcting  disk  array  the  basic  unit  is  a  parity  stripe  or  group.  There  can  be  no  overlap 
of  disk  accesses.  One  check  disk  is  assigned  to  each  information  disk  in  each  group.  Two 
basic  types  of  writes  have  been  described  [33].  If  more  than  half  of  the  parity  stripe  is  being 
written  then  all  of  the  remaining  information  disks  in  the  stripe  are  pre-read,  the  parity  for 
the  stripe  is  computed  and  the  new  data  and  parity  are  written  [33].  If  however,  the  new 
data  writes  to  less  than  half  of  the  parity  stripe,  a  read-modify-write  is  preferred  [33].  In  this 
case  the  requested  information  disks  and  their  associated  check  disks  are  pre-read,  the  new 
parity  is  computed  and  then  the  new  data  and  parity  is  written  back  out. 

In  a  multiple  erasure  code,  some  differences  arise.  The  basic  assumption  that  one  check 
disk  is  assigned  to  each  data  disk  is  no  longer  valid.  Multiple  data  disks  in  the  same  stripe 
may  use  the  same  check  disk.  This  changes  the  manner  in  which  a  write  must  be  performed. 
In  order  to  optimize  performance  it  is  necessary  to  avoid  the  unnecessary  reading  and  writing 
of  a  disk  more  than  once  within  a  single  parity  stripe  when  check  disks  coincide  during  an 
individual  write.  Indeed,  in  a  small  write,  if  multiple  information  disks  are  involved,  it  may 
not  be  necessary  to  read  and  write  t  check  disks  for  each  data  disk  in  a  f-erasure  code. 
Although,  the  minimum  update  penalty  of  a  t-erasure  code  remains  t  [32,  36],  the  actual 
penalty  may  be  reduced  by  changing  the  order  of  columns  in  the  parity  check  matrix.  We 
focus  our  attention  here  on  the  read-modify-write.  This  is  the  more  interesting  type  of  write, 
since  it  is  an  expensive  operation  in  a  triple  erasure  code  and  hence  provides  some  potential 
for  improvement. 

Every  individual  disk  write  in  a  three-erasure  configuration  could  involve  up  to  four  reads 
and  four  writes.  When  more  than  one  disk  in  a  stripe  is  being  written,  however,  the  number 
of  extra  reads  and  writes  per  information  disk  may  decrease  due  to  overlap  in  check  disks. 
A  diagram  of  how  read-modify  writes  might  occur  in  a  triple-erasure  code  can  be  seen  in 
Figure  1.  It  shows  a  scenario  in  which  data  is  striped  across  two  disks  with  one  check  disk  in 
common.  This  might  happen  in  a  case  where  the  two  triples  {1,2,3}  and  {1,4,5}  are  checking 
two  data  disks  used  in  the  same  write.  In  this  case  only  five  check  disks  are  read  and  written. 
(Check  disk  one  only  needs  to  be  read  and  written  once,  saving  two  disk  accesses.) 

Thus  the  various  orderings  of  the  columns  of  the  parity  check  matrix  may  reduce  the 
overall  number  of  reads  and  writes. 
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READ-MODIFY  WRITE  -  STEINER  TRIPLE  SYSTEMS 


This  is  a  scenario  with  2  information  disks  in  the  same  stripe  that  share  one  parity  disk. 

Each  of  the  information  disks  is  first  read  and  written.  Next  the  needed  parity  disks  are  read  and 
buffered  in  memory.  Each  parity  in  the  buffer  is  XOR’ed  with  the  old  data  and  new  data. 

The  new  information  is  written  back  to  each  of  the  parity  disks. 


Figure  1:  Read  Modify  Writes  in  a  Steiner  Triple  System 


Simulation  Workload  | 

|  Percent 

Operation 

Size(KB)  |  Alignment(KB)  || 

1  Read  | 

|  100% 

Read 

72 

24  1 

|  Write  |! 

|  100% 

Write 

72 

24 

|  Mixed  || 

82% 

Read 

72 

24 

18% 

Write 

72 

24 

Table  1:  Ordering  Workload 


Experiments 

The  performance  experiments  are  run  with  a  workload  shown  in  Table  1.  Approximately 
100  experiments  are  run  for  each  system.  Each  experiment  is  tested  in  fault  free  mode  and 
then  with  one,  two,  three,  four  and  five  simultaneous  failures.  All  disk  failures  occur  at  the 
start  of  the  experiment  and  persist  for  the  entire  test. 

We  focus  on  Steiner  triple  systems  of  order  15.  There  are  eighty  non- isomorphic  systems 
of  order  15.  Six  different  systems  were  chosen  with  varying  numbers  of  Pasch  configurations. 
System  one  and  system  eighty  differ  the  most  in  the  number  of  configurations  and  are  there¬ 
fore  the  most  different  structurally.  To  make  sure  that  we  did  not  neglect  other  differences 
in  these  codes,  we  chose  four  other  systems. 

Three  orderings  are  used  for  the  experiments;  these  are  discussed  in  more  detail  in  [14]. 
The  first  is  one  in  which  the  overlap  among  consecutive  blocks  is  maximized  for  reads  or 
writes  that  span  multiple  disks.  This  could  minimize  the  overall  number  of  disk  accesses 
in  a  write  which  spans  information  disks.  The  second  ordering  has  the  property  that  every 
set  of  three  consecutive  blocks  are  pairwise  disjoint.  The  last,  designed  for  comparison,  is 
randomly  permuted. 
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Response  Time  For  Straight  Writes  -  STS(t5)  Ordering  C 


Number  of  Failures 

Figure  2:  Write  Workload  Comparison  -  Ordering  C 


Performance  Results 

Interestingly,  the  structure  of  the  codes  does  not  appear  to  play  a  significant  role  in  the 
performance  of  these  arrays.  This  is  illustrated  by  Figure  2  which  shows  the  results  for  the 
permuted  ordering  under  a  workload  of  straight  writes.  These  simulations  do  not  appear  to 
distinguish  between  different  systems  of  order  15.  Similar  results  are  observed  using  the  other 
two  orderings.  The  structure  of  the  code,  however,  does  determine  the  erasure  correction 
capability  of  a  code.  Given  that  no  observed  performance  differences  are  seen,  anti-Pasch 
Steiner  triple  systems  may  be  more  desirable  in  situations  when  higher  reliability  is  desired 

[»]■ 

Consistently,  in  all  of  these  experiments,  the  order  of  columns  in  the  parity  check  matrix 
most  reliably  predicts  performance.  The  lowest  response  times  appear  using  the  first  order¬ 
ing,  labeled  A.  This  is  the  ordering  that  attempts  to  maximize  overlap  among  each  group 
of  three  consecutive  information  disks.  The  second  ordering,  B ,  gives  the  slowest  response 
time.  This  is  the  ordering  that  uses  pairwise  disjoint  consecutive  blocks.  Response  times 
for  the  third  ordering,  C ,  lie  between  these  two  extremes.  This  is  the  permuted  or  random 
ordering. 

In  a  read-only  workload  there  is  no  apparent  difference  in  the  performance  of  the  various 
orderings  in  fault  free  mode.  This  is  expected  since  a  read  workload  does  not  access  the 
check  disks  (see  Figure  3).  However,  the  response  time  starts  to  diverge  as  the  number  of 
failed  disks  increases.  This  can  be  attributed  to  accesses  to  additional  disks  required  in 
reconstruction.  As  seen  in  Figure  4,  the  most  dramatic  difference  in  performance  among  the 
different  orderings  is  seen  in  a  straight  write  workload.  This  is  expected  since  writes  incur  a 
significant  update  penalty.  Although  these  account  for  only  a  small  portion  of  the  accesses 
in  a  mixed  workload,  the  overall  performance  pattern  still  follows  that  of  a  straight  write 
workload.  This  can  be  seen  in  Figure  5. 

The  results  of  this  experimentation  led  us  to  ask  fundamental  questions  about  ordering 
in  Steiner  triple  systems  and  their  application  to  RAID  arrays.  However,  the  more  exciting 
possibility  arises  for  double  erasure  codes.  With  current  sizes  of  disk  arrays,  triple  erasure 
correction  is  rarely  needed  [32].  Double  erasure  correction  is  needed  more  often  [41],  and  a 
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Read  Workload  Comparision  of  Orderings  lor  STS(15) 


a 


Figure  3:  Ordering  Results  -  Straight  Read  Workload 


Write  Workload  Comparision  of  Orderings  for  STS(15) 


o  3000 


_ _ _ a  q 


Steiner  15  systems  combined 
-630  experiments/data  point 


Figure  4:  Ordering  Results  -  Straight  Write  Workload 


number  of  practical  schemes  have  been  suggested  [4,  41].  The  general  framework  in  [8,  32] 
encompasses  the  schemes  so  far  studied.  In  this  framework,  a  scheme  for  double  erasures 
having  all  update  penalties  equal  to  two  can  be  viewed  as  a  graph.  In  particular,  the  P 
matrix  is  the  incidence  matrix  of  a  graph  on  c  vertices  and  n  edges.  Then  the  reliability 
question  is  to  characterize  the  c-vertex  n-edge  graph  which  corrects  the  largest  number  of 
erasures  of  more  than  two  disks  (since  all  graphs  give  schemes  to  correct  all  sets  of  two 
or  fewer  erasures).  The  performance  question  is  to  determine,  among  those  graphs  which 
exhibit  acceptable  reliability,  how  to  represent  the  parity  check  matrix  (i.e.  how  to  order 
it)  to  optimize  user  response  time,  by  reducing  the  effective  update  penalty  for  small  writes. 
Both  questions  are  complicated  by  the  need  for  scalability,  the  requirement  that  the  disk 
array  be  expandable  to  meet  changing  needs  for  information  content.  Scalability  leads  to 
further  ‘ordering’  questions,  namely  the  order  in  which  disks  are  to  be  removed  from  or 
added  to  an  existing  disk  array  [32], 

The  impact  of  optimizing  the  selection  of  H  for  reliability  is  well  understood  in  principle 
[32],  but  the  construction  of  highly  reliable  codes  is  not  well  developed  [8].  In  fact,  these 
constructions  involve  difficult  open  questions  on  combinatorial  designs  [30,  36].  The  sub- 
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Mixed  Workload  Comparision  of  Orderings  for  STS  (15) 


Number  of  Failures 


Figure  5:  Ordering  Results  -  Mixed  Workload 

stantial  effect  of  the  selection  of  the  ordering  of  H  on  performance  has  become  clear  [11,  14] 
as  a  result  of  the  project  research. 

In  addition  to  the  effort  described  above  which  addresses  the  main  focus  of  the  completed 
grant,  the  PI  has  been  very  active  in  the  areas  of  combinatorial  algorithms,  and  combinatorial 
designs.  In  ARO  Project  DAAG-98-1-0272,  “Reliability  of  Large  Scale  Disk  Arrays”,  we 
have: 

1.  Implemented  a  simulation  environment  for  the  study  of  reliability  of  erasure  codes  for 
the  correction  of  three  simultaneous  erasures  [11,  13]. 

2.  Extended  computational  search  techniques  for  generating  optimal  codes  to  correct  for 
three  or  four  erasures  [15,  8,  39]. 

3.  Generalized  a  classical  construction  technique  (the  Bose  method)  for  producing  triple 
erasure  codes  [24,  44]. 

4.  Developed  the  essential  mathematical  machinery  [36]  for  a  complete  solution  to  Erdos’s 
problem  [30]  leading  to  an  existence  result  for  triple  erasure  codes. 

5.  Obtained  preliminary  results  on  structural  aspects  of  erasure  codes  and  their  RAID 
implementation  [11,  14]. 

Concurrently,  the  methods  developed  in  this  research  have  found  broad  application  to  many 
domains: 

1.  Multiple  access  communications  for  MT-MFSK  signaling  [26,  50,  51]. 

2.  Design  of  bus  networks  [17]. 

3.  Superimposed  codes  for  nonadaptive  group  testing,  for  satellite  reservation  systems 
[7,  9,  16]  and  for  frameproof  codes  [47,  48]. 

4.  Quorum  systems  for  mutual  exclusion  [21,  22]. 
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5.  Ring  grooming  in  SONET  networks  to  minimize  electrical/optical  conversion  [25] . 

General  themes  underlying  these  applications  are  discussed  in  [20],  and  specific  issues  are 
addressed  in  [18]. 
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